Lecture 14
For an \(n\)-dimension multivariate normal distribution with covariance \(\boldsymbol{\Sigma}\) (positive semidefinite) can be written as
\[ \underset{n \times 1}{\boldsymbol{y}} \sim N(\underset{n \times 1}{\boldsymbol{\mu}}, \; \underset{n \times n}{\boldsymbol{\Sigma}}) \]
\[ \begin{pmatrix} y_1\\ \vdots\\ y_n \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_1\\ \vdots\\ \mu_n \end{pmatrix}, \, \begin{pmatrix} \rho_{11}\sigma_1\sigma_1 & \cdots & \rho_{1n}\sigma_1\sigma_n \\ \vdots & \ddots & \vdots \\ \rho_{n1}\sigma_n\sigma_1 & \cdots & \rho_{nn}\sigma_n\sigma_n \\ \end{pmatrix} \right) \]
For the \(n\) dimensional multivate normal given on the last slide, its density is given by
\[ (2\pi)^{-n/2} \; \det(\boldsymbol{\Sigma})^{-1/2} \; \exp{\left(-\frac{1}{2} \underset{1 \times n}{(\boldsymbol{y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{y}-\boldsymbol{\mu})}\right)} \]
and its log density is given by
\[ -\frac{n}{2} \log 2\pi - \frac{1}{2} \log \det(\boldsymbol{\Sigma}) - -\frac{1}{2} \underset{1 \times n}{(\boldsymbol{y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{y}-\boldsymbol{\mu})} \]
To generate draws from an \(n\)-dimensional multivate normal with mean \(\underset{n \times 1}{\boldsymbol{\mu}}\) and covariance matrix \(\underset{n \times n}{\boldsymbol{\Sigma}}\),
Find a matrix \(\underset{n \times n}{\boldsymbol{A}}\) such that \(\boldsymbol{\Sigma} = \boldsymbol{A}\,\boldsymbol{A}^t\)
\[ \boldsymbol{\mu} = \begin{pmatrix}0 \\ 0\end{pmatrix} \qquad \boldsymbol{\Sigma} = \begin{pmatrix}1 & \rho \\ \rho & 1 \end{pmatrix}\]
Proposition - For an \(n\)-dimensional multivate normal with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\), any marginal or conditional distribution of the \(y\)’s will also be (multivariate) normal.
Univariate marginal distribution: \[ y_i = N(\boldsymbol{\mu}_i,\,\boldsymbol{\Sigma}_{ii}) \]
Bivariate marginal distribution: \[ \boldsymbol{y}_{ij} = N\left( \begin{pmatrix}\boldsymbol{\mu}_i \\ \boldsymbol{\mu}_j \end{pmatrix},\; \begin{pmatrix} \boldsymbol{\Sigma}_{ii} & \boldsymbol{\Sigma}_{ij} \\ \boldsymbol{\Sigma}_{ji} & \boldsymbol{\Sigma}_{jj} \end{pmatrix} \right) \]
\(k\)-dimensional marginal distribution:
\[ \boldsymbol{y}_{i,\ldots,k} = N\left( \begin{pmatrix}\boldsymbol{\mu}_{i} \\ \vdots \\ \boldsymbol{\mu}_{k} \end{pmatrix},\; \begin{pmatrix} \boldsymbol{\Sigma}_{ii} & \cdots & \boldsymbol{\Sigma}_{i k} \\ \vdots & \ddots & \vdots \\ \boldsymbol{\Sigma}_{k i} & \cdots & \boldsymbol{\Sigma}_{k k} \end{pmatrix} \right) \]
If we partition the \(n\)-dimensions into two pieces such that \(\boldsymbol{y} = (\boldsymbol{y}_1,\, \boldsymbol{y}_2)^t\) then
\[ \underset{n \times 1}{\boldsymbol{y}} \sim N\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix},\, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \right) \] \[ \begin{aligned} \underset{k \times 1}{~~\boldsymbol{y}_1~~} &\sim N(\underset{k \times 1}{~~~\boldsymbol{\mu}_1~~~},\, \underset{k \times k}{~~~\boldsymbol{\Sigma}_{11}~~~}) \\ \underset{n-k \times 1}{\boldsymbol{y}_2} &\sim N(\underset{n-k \times 1}{\boldsymbol{\mu}_2},\, \underset{n-k \times n-k}{\boldsymbol{\Sigma}_{22}}) \end{aligned} \]
then the conditional distributions are given by
\[ \begin{aligned} \boldsymbol{y}_1 ~|~ \boldsymbol{y}_2 = \boldsymbol{a} ~&\sim N(\boldsymbol{\mu_1} + \boldsymbol{\Sigma_{12}} \, \boldsymbol{\Sigma_{22}}^{-1} \, (\boldsymbol{a} - \boldsymbol{\mu_2}),~ \boldsymbol{\Sigma_{11}}-\boldsymbol{\Sigma_{12}}\,\boldsymbol{\Sigma_{22}}^{-1} \, \boldsymbol{\Sigma_{21}}) \\ \\ \boldsymbol{y}_2 ~|~ \boldsymbol{y}_1 = \boldsymbol{b} ~&\sim N(\boldsymbol{\mu_2} + \boldsymbol{\Sigma_{21}} \, \boldsymbol{\Sigma_{11}}^{-1} \, (\boldsymbol{b} - \boldsymbol{\mu_1}),~ \boldsymbol{\Sigma_{22}}-\boldsymbol{\Sigma_{21}}\,\boldsymbol{\Sigma_{11}}^{-1} \, \boldsymbol{\Sigma_{12}}) \end{aligned} \]
From Shumway,
A process, \(\boldsymbol{y} = \{y(t) ~:~ t \in T\}\), is said to be a Gaussian process if all possible finite dimensional vectors \(\boldsymbol{y} = (y_{t_1},y_{t_2},...,y_{t_n})^t\), for every collection of time points \(t_1, t_2, \ldots , t_n\), and every positive integer \(n\), have a multivariate normal distribution.
So far we have only looked at examples of time series where \(T\) is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where \(T\) is defined on a continuous space (e.g. \(\mathbb{R}\) or some subset of \(\mathbb{R}\)).
Imagine we have a Gaussian process defined such that \(\boldsymbol{y} = \{y(t) ~:~ t \in [0,1]\}\),
Necessary to make some simplifying assumptions:
Stationarity
Simple(r) parameterization of \(\Sigma\)
More on these next week, but for now some common examples
Exponential covariance: \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]
Squared exponential covariance (Gaussian): \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-(|t-t'| \; l\,)^2\big) \]
Powered exponential covariance \(\left(p \in (0,2]\right)\): \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-(|t-t'| \; l\,)^p\big) \]
Letting \(\sigma^2 = 1\) and trying different values of the length scale \(l\),
Recall that for a stationary AR(1) process: \(\gamma(h) = \sigma^2_w \phi^{|h|}\) and \(\rho(h) = \phi^{|h|}\)
we can draw a somewhat similar picture about the decay of \(\rho\) as a function of distance.
Our example has 20 observations \(\left(\boldsymbol {y}_{obs} = \left(y(t_1), \ldots, y(t_{20})\right)'\right)\), which we would like to use as the basis for predicting \(y(t)\) at other values of \(t\) (say a regular sequence of values from 0 to 1), \(\left( \boldsymbol {y}_{pred} \right)\).
For now lets use a squared exponential covariance with \(\sigma^2 = 10\) and \(l=15\), as such the covariance is given by:
\[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]
Our goal is then to obtain samples from \(\boldsymbol{y}_{pred} | \boldsymbol{y}_{obs}\) which has the following distribution,
\[\boldsymbol{y}_{pred} ~|~ \boldsymbol{y}_{obs} = \boldsymbol{y} ~\sim N(\boldsymbol{\Sigma}_{po} \, \boldsymbol{\Sigma}_{obs}^{-1} \, \boldsymbol{y},~ \boldsymbol{\Sigma}_{pred}-\boldsymbol{\Sigma}_{po}\,\boldsymbol{\Sigma}_{pred}^{-1} \, \boldsymbol{\Sigma}_{op})\]
Now lets consider an exponential covariance model instead where \(\sigma=10\), \(l = \sqrt{15}\),
\[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]
We are still sampling from \(\boldsymbol{y}_{pred} ~|~ \boldsymbol{y}_{obs}\), all that has changed is the values of the covariance matrices.
What “paths” do we get with this covariance? How are they similar and how are they different from the squared exponential covariance?
For the square exponential covariance \[ \begin{aligned} Cov(d) &= \sigma^2 \exp\left(-(l \cdot d)^2\right) \\ Corr(d) &= \exp\left(-(l \cdot d)^2\right) \end{aligned} \]
we would like to know, for a given value of \(l\), beyond what distance apart must observations be to have a correlation less than \(0.05\)?
\[ \begin{aligned} \exp\left(-(l \cdot d)^2\right) &< 0.05 \\ -(l \cdot d)^2 &< \log 0.05 \\ l \cdot d &< \sqrt{3} \\ d &< \sqrt{3} / l \end{aligned} \]
Running /opt/homebrew/Cellar/r/4.3.1/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.0.40.1)’
using SDK: ‘MacOSX14.0.sdk’
clang -I"/opt/homebrew/Cellar/r/4.3.1/lib/R/include" -DNDEBUG -I"/Users/rundel/Library/R/arm64/4.3/library/Rcpp/include/" -I"/Users/rundel/Library/R/arm64/4.3/library/RcppEigen/include/" -I"/Users/rundel/Library/R/arm64/4.3/library/RcppEigen/include/unsupported" -I"/Users/rundel/Library/R/arm64/4.3/library/BH/include" -I"/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/src/" -I"/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/" -I"/Users/rundel/Library/R/arm64/4.3/library/RcppParallel/include/" -I"/Users/rundel/Library/R/arm64/4.3/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include -fPIC -g -O2 -c foo.c -o foo.o
<built-in>:1:10: fatal error: '/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' file not found
#include "/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[1]: *** [foo.o] Error 1
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y ~ gp(t)
Data: d (Number of observations: 20)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Gaussian Process Terms:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sdgp(gpt) 1.60 0.63 0.80 3.28 1.02 282 374
lscale(gpt) 0.12 0.03 0.08 0.17 1.01 249 310
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.01 0.79 -2.82 0.45 1.01 1118 1075
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.19 0.05 0.11 0.32 1.01 941 1403
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
gp() serves a similar function to arma() within a brms model - it specifies the structure of the model which is then translated into stan.
The covariance function is specified using the cov argument, currenly only "exp_quad" is the default and only supported covariance.
The covariance parameterization used differs slightly from what was given previous (but is equivalent),
\[ k(x_i, x_j) = \text{\{sdgp\}}^2 \exp(−||x_i − x_j||^2 / (2 \,\text{\{lscale\}}^2)) \]
Unlike our previous conditional samples these predictions no longer pass exactly through each observation and so we get a bit of a better behaved curve and much more reasonable intervals.
The model being fit by brms has one additional parameter that our previous example did not have - a nugget covariance (sigma in the brms summary).
So what does the covariance look like?
\[ k(x_i, x_j) = \text{\{sdgp\}}^2 \exp(−||x_i − x_j||^2 / (2 \,\text{\{lscale\}}^2)) + \text{\{sigma\}}^2 \mathbb{1}_{i=j} \]
Sta 344/644 - Fall 2023